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Abstract 

This paper presents a novel population genetic model and a computationally and statistically tractable framework for 
analyzing within-host HIV diversity based on serial samples of HIV DNA sequences. This model considers within-host HIV 
evolution during the chronic phase of infection and assumes that the HIV population is homogeneous at the beginning, 
corresponding to the time of seroconversion, and evolves according to the Wright-Fisher reproduction model with 
recombination and variable mutation rate across nucleotide sites. In addition, the population size and generation time vary 
over time as piecewise constant functions of time. Under this model I approximate the genealogical and mutational 
processes for serial samples of DNA sequences by a continuous coalescent-recombination process and an inhomogeneous 
Poisson process, respectively. Based on these derivations, an efficient algorithm is described for generating polymorphisms 
in serial samples of DNA sequences under the model including various substitution models. Extensions of the algorithm are 
also described for other demographic scenarios that can be more suitable for analyzing the dynamics of genetic diversity of 
other pathogens in vitro and in vivo. For the case of the infinite-sites model, I derive analytical formulas for the expected 
number of polymorphic sites in sample of DNA sequences, and apply the developed simulation and analytical methods to 
explore the fit of the model to HIV genetic diversity based on serial samples of HIV DNA sequences from 9 HIV-infected 
individuals. The results particularly show that the estimates of the ratio of recombination rate over mutation rate can vary 
over time between very high and low values, which can be considered as a consequence of the impact of selection forces. 
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Introduction 

Recombination has an important role in shaping the dynamics 
of within-host HIV genetic diversity, particularly making the virus 
capable of escaping the pressures of antiviral drugs and immune 
system [1,2]. Therefore, it is of great interest to model this process 
at the within-host HIV population genomic level. In contrast to 
the existing methods in the literature, this paper develops a novel 
population genetic model including recombination and a compu- 
tationally and statistically tractable framework for this model to 
explore the dynamics of within-host HIV genetic diversity based 
on serial samples of HIV DNA sequences. 

Early studies [3-8] analyzed serial samples of HIV DNA 
sequences from HIV- 1 -infected individuals by developing frame- 
works based on statistically and computationally tractable models, 
which, however, neglect the impact of recombination or show less 
mimicking power for the dynamics of within-host HIV genetic 
diversity. Such an example is the standard coalescent model [9- 
13] that represents a continuous approximation for genealogical 
and mutational processes for samples of DNA sequences under the 
Wright-Fisher model with constant population size but without 
recombination. Another example is the ancestral recombination 
graph [14,15] that extends the standard coalescent by including 
recombination. Shriner et al [16] used various computational 
tools [17-19] based on this model to infer within-host HIV 



recombination rate. Although both models are attractive because 
of computational and statistical tractabUity, the expected dynamics 
of genetic diversity in serial samples of DNA sequences under these 
models are inconsistent with observed dynamics of within-host 
HIV genetic diversity. 

Under these models the expected numbers of average pairwise 
differences in serial samples stay the same [20] and independent 
on recombination rate [15]. This is a consequence of a more 
general fact that samples of DNA sequences at different time 
points under these models are not affected by temporal factor 
because the distributions of the polymorphisms in equal-size 
temporal samples are the same due to the same genealogical and 
mutational processes. In contrast to these expectations, Shankar- 
appa et al [2 1] observed that the divergences and average numbers 
of pairwise differences in serial samples of HIV DNA sequences 
from 9 HIV- 1 -infected individuals increased linearly for several 
years after seroconversion but declined or stabilized late in the 
infection (see Figures 1 and 2 by [21]). 

From the same data sets, I also observe linear relationships 
between the dynamics of the numbers of polymorphic sites and the 
numbers of average of pairwise differences in each individual's 
case (Figure 1). To make the linear relationships more obvious, I 
use the following normalization because two data sets are linearly 
related if and only if their normalizations are the same. Let 
{xi},i=l, . . . ,n, be the values of one of the statistics in serial 
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samples. If (t{x) is not equal to 0, where (Jix) = 

Y^"^ 1 (Xi — 5cf/n and x= ^^"^1 Xj/n, then the normalized 
values are 

{ixi-x)/aix)},i=l,...,n, (1) 

otherwise they are set to be 0. Figure 2 shows the dynamics of the 
normalized values of the two statistics from the observed samples. 

To illustrate non-linear relationship between the expected 
dynamics of the two statistics for the observed sample sizes under 
the Wright-Fisher model with constant population size, I compute 
the expected numbers of polymorphic sites under the finite-sites 
Jukes-Cantor model as well as under the infinite-sites model by 
using the formulas of Tajima [22] and Watterson [23], respec- 
tively. Figure 1 shows the expected numbers of polymorphic sites 
for the observed sample sizes under both mutational models. The 
normahzed values of the expected numbers of polymorphic sites 
are in Figure 2, which makes obvious the non-linear relationship 
between the expected dynamics of the two statistics due to the 
differences in the sample sizes. 

While the standard coalescent as well as the ancestral 
recombination graph might not be direcdy applicable for 
analyzing the dynamics of HIV genetic diversity within a host, 
the concepts and features of these models had and have great 
impact on extending coalescent theory for other evolutionary 
settings. Both models were derived as continuous limits of discreet 
genealogical and mutational processes under the Wright-Fisher 
models with constant population size and with and without 
recombination by applying the time scaling concept that is 
measuring time proportional to a very large population size. This 
concept was also applied for other forward in time Wright-Fisher 
models with variable population size, selection, or migration but 
without recombination (see e.g. [15,24,25]) to derive continuous 
coalescent models. An attractive feature of continuous approxi- 
mations is that the models are computationally and statistically 
tractable for analyzing samples of DNA sequences by efficientiy 
generating samples of DNA sequences and combining or 
contrasting the generated data sets with observed data. 

Later studies [26—30] combined the variable-population-size 
coalescent model [31] with phylogenetic methods to derive 
frameworks for analyzing serial samples of DNA sequences. 
However, these methods ignore recombination, and that can be a 
significant limitation, particularly, for analyzing serial samples of 
HIV DNA sequences because HIV genome is strongly affected by 
recombination [32]. Furthermore, Schierup and Hein [33] 
showed that the standard phylogen(;tic methods ignoring recom- 
bination can result in misleading inference. Other tools for 
analyzing serial samples under the variable population size model 
were also developed by Anderson et al [34] and Jakobsson [35] by 
using the discrete approximation method of Excoffier et al [36]. In 
this m(tth()d the' g('nc'al()g\' of scquc'nces are (xmstructed by tracing 
their lineages generation-by-generation back in time, and this gives 
an advantage of easily adapting this approach for other 
evolutionary scenarios including recombination. However, this 
method is computationally less efficient in comparison to the 
continuous approximation based methods in which the genealo- 
gies of sequences are constructed by tracing consecutive coalescent 
and recombination events back in time. 

To overcome the limitations of the previous models and 
methods mentioned above, I first describe a forward in time 
population genetic model to represent HIV evolution in HIV- 
infected individuals in the chronic phase of infection. The 
population in the model is considered to be homogeneous at the 



beginning, representing the time of HIV seroconversion, and to 
evolve according to the Wright-Fisher reproduction model with 
recombination by allowing the population size and generation 
time to vary over time. To make this model computationally and 
statistically tractable for analyzing serial samples of DNA 
sequences, I apply the time scaling approach at multiple time 
intervals (instead of a single time interval as in previous methods) 
to describe a continuous coalescent-recombination process for 
tracing the lineages of the samples back in time and superimposing 
mutational events on the lineages according to an inhomogeneous 
Poisson process. Based on these processes I describe computa- 
tionally efficient algorithm for generating polymorphisms in serial 
samples of DNA sequences drawn randomly under this population 
genetic model. Further extensions of the algorithm are also 
described for population genetic models that can be more suitable 
for analyzing the dynamics of genetic diversity of other pathogen 
populations in vivo and in vitro. 

Within this framework I consider two substitution models: a 
finite-sites model with variable mutation rate across nucleotide 
sites and the infinite-sites model. For the infinite-sites case, 1 derive 
analytical formulas for the expected number of polymorphic sites 
in samples of DNA sequences. For this quantity, Tajima [37] also 
derived an analytical expression by using a different approach. 
Thus, the developed simulation and analytical methods I apply to 
serial samples of HIV DNA sequences from 9 HIV-infected 
individuals [21] to explore the fit (the mimicking power) of the 
model to the data sets and to identify and quantify the signatures 
of recombination and selection on the dynamics of within-host 
HIV genetic diversity. Within this analysis I particularly explore 
the contrasting estimates for within-host HIV recombination rate 
by previous studies [16,38,39]. 

Methods 

The population genetic model 

To model within-host HIV evolution, I take into account the 
following observations: (1) HIV population within HIV-infected 
individuals usually collapses at seroconversion (the onset of the 
antiviral immune response) after several weeks of infection and 
recovers quickly as a homogeneous population [40-42]. (2) The 
viral load and CD4-h cell counts within HIV-infected individuals 
change over time, and I take this as an intuitive base for 
considering variability for within-host HIV generation time and 
population size. Thus, I consider a population model that is 
homogeneous at the beginning (representing the time of serocon- 
version) and evolves according to the neutral Wright-Fisher 
reproduction model with recombination, in which the population 
size and generation time vary over time but stay constant between 
consecutive sampling time points. 

In this model DNA sequences are represented as a combination 
of L consecutive loci, each of which consists / nucleotides. 
Recombination events along the lineages per sequence per 
generation occur with rate r and recombinations (crossovers) 
between two sequences are allowed only at the breakpoints 
between the consecutive loci. Mutation events per sequence per 
generation occur with rate fi, and the substitutions at the 
nucleotide sites occur according to a finite-sites model. Although 
various finite-sites substitution models [43-48] can be incorporat- 
ed within this model, 1 only consider the infinite-sites model and a 
finite-sites Jukes-Cantor model with variable mutation rate across 
nucleotide sites. Note that the infinite-sites model is a limiting case 
of the finite-sites model by considering the locus length / to be very 
large. 
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Figure 1. The observed and expected numbers of polymorphic sites and average pairwise differences in serial samples. The 

horizontal axis of each panel indicates sampling time since seroconversion. (A) The observed numbers of polymorphic sites, S„i{i), and average 
numbers of pairwise differences, n„,(/), in serial samples are plotted with respect to the sampling times; the data points determined by the two 
statistics are connected by blue and red lines, respectively. (B) and (C) show the expected numbers of polymorphic sites in serial samples under the 
Wright-Fisher model with constant population size combined with the finite-sites Jukes-Cantor model, as well as with the infinite-sites model, 
respectively. Under these substitution models, the expected values of this statistic for sample size «, at time T, are denoted by EjcS„.(i) and EiS„.(i), 
respectively. The expected average numbers of pairwise differences for the serial samples in each individual's case are not shown since they are the 
same for the samples. 
doi:10.1371/journal.pone.0087655.g001 
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Figure 2. Normalized observed and expected values of the two summary statistics. (A) The dynamics of the observed values of the two 
statistics in serial samples (Figure 1) are normalized based on transformation (1) and denoted by A'5'„,(/) and Nn„.{i), respectively. (B) and (C) show 
the normalized values of the expected numbers (Figure 1) of polymorphic sites in serial samples under the finite-site Jukes-Cantor model and the 
infinite-sites model denoted by NEjcS„^(i) and NE/S„^(i), respectively. The normalized values of the expected average numbers of pairwise 
differences in serial samples are equal to 0 and are not plotted. 
doi:1 0.1 371 /journal.pone.0087655.g002 
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Table 1 . The sizes of the groups of classified nucleotide sites 
based on the alignments of DNA sequences in serial samples 
for each individual's case. 



individual 


group 2' 


group 3'' 


Pti 


240 


70 


Pt2 


307 


155 


Pt3 


181 


59 


Pt5 


275 


104 


Pt6 


215 


49 


Pt7 


204 


88 


Pt8 


287 


88 


Pt9 


257 


102 


Ptil 


209 


46 



^Group 2 includes polymorphic sites at which only two nucleotides are present. 
"^Group 3 includes polymorphic sites at which more than two nucleotides are 
present. 

doi:l 0.1 371 /journal.pone.0087655.t001 



I design the finite-sites model in a such way that it represents the 
heterogeneity of substitution rate across nucleotide sites and infer 
some of the parameters in this model based on serial samples of 
HIV-1 DNA sequences from envelop gene region [21]. In this 
model the sequences are combinations of two regions L/, and L/ 
with high and low mutation rates, respectively. To have this 
contrast, I assume that region L;, has less nucleotides than region 
Li and mutations occur uniformly within the regions with the 
same rate 11 per generation per region. In this scenario nucleotide 
changes at mutation events occur according to the Jukes-Cantor 
model [43]. 

The population model for m serial samples is parameterized as 
follows. Let To =0 be the time of seroconversion, and the sampling 
time points since seroconversion are labeled as Ti,T2, ■ ■ ■ ,T„,, in 
chronological order, measured in years, days, or hours. Let N, and 
gi be respectively the population size and generation time for time 
interval (Tj_i,Tj). The collection of these parameters are 
represented as = (Ni , . . . ,Ni„) and g = (g\, - ■ ■ ,gm)- The sample 
size at time T; is denoted by 

Results 

Variation in a sample of DNA sequences under the above 
population model can be described by combining the genealogical 
history of the sequences with mutations on the lineages of the 
sequences. The genealogical history traces the ancestral lineages of 
the sequences back in time before time Tq = 0. I approximate this 
process by a continuous coalescent-recombination process de- 
scribed as follows. I consider Ni,i=\, . . . ,m, to be large and 
measure time in time interval (T',_i,T',) by Njgj time units and 
approximate the genealogical history of the sample in this interval 
by the ancestral recombination graph [14,15] with scaled 
recombination rate Rj = INjr. This ancestral graph is the same 
as the one derived under the Wright-Fisher model with constant 
population size N,, generation time gi, and recombination rate r 
per sequence per generation. Note that instead of using a single 
time scaling as in previous studies, I scale time in multiple time 
intervals, which gives an advantage to derive a continuous 
coalescent-recombination process for the above described popu- 
lation genetic model. 

For each interval (T)-!,?",), the tracing procedure maps to a 
continuous-time Markov chain for which time varies between 0 



and T,, Ti = (Ti — Ti_i)/{Njgi). The chain is described by the 
transition time T (0 < T < T,) and the number of lineages at time T, 
denoted by k. Initially, the values of T and k are set respectively 
equal to 0 and A:, , where fc, is the total number of sequences in two 
sets: one set includes the sampled sequences at time point Tj, the 
other set represents the sequences at time Ti that are linked to the 
lineages traced between time points 7",+ i and Ti. A possibility of 
overlap between the two sets are ignored since Ni is considered to 
be large. 

For this Markov chain the transition from the state (t,^) is 
described as follows: First, a random number x is generated from 
an exponential distribution with parameter {kRi + k(k—iy)/2. If 
x-j-T is greater than T,-, the k lineages are traced "straight" before 
time point T,-; the value of T is updated by the value of T,-, and the 
procedure for this time interval stops. Otherwise, the value of 1 is 
updated by the value of A' + t and the lineages are traced before 
time T. The set of A: sequences linked to the lineages at time t are 
modified by a coalescent event with probability 
(k—l)/{Ri + k—l) or a recombination event with probability 
Ri/(Ri + k—l). In the case of coalescent event, two lineages are 
randomly chosen out of the k lineages and merged into a single 
lineage. The set of the sequences at this time point is updated by 
replacing the two sequences of the merging lineages with a single 
sequence and decreasing the value of k by 1. Otherwise (in the 
case of recombination) creating two new sequences by randomly 
choosing one of the k sequences and one of the L—l breakpoints 
and copying left and right segments of that sequence with respect 
to that breakpoint into two sequences. The chosen sequence is 
discarded from the set of k sequences. If any of the two new 
sequences does not share an ancestral segment with the sequences 
sampled at time Ti, . . . ,Tm, that sequence is also discarded and the 
other one added to the set of k sequences. Otherwise, the two new 
sequences are added to the set of the k sequences. In the latter case 
the value of k is increased by 1 . Recursively repeating the steps 
and updating values of T and k, the procedure continuously traces 
the lineages before time t,. 

The following algorithm uses the tracing procedure recursively 
to describe a bottom up process for generating genealogical history 
of serial samples and a top down process for adding mutation 
events on the genealogy. The algorithm can be used to generate 
variation in serial samples under the population model described 
above. 

Algorithm 1 

1. Set the values of /, k, and t equal to m, «„,, 0, respectively. 

2. Apply the above procedure for tracing the lineages of the k 
sequences in the time interval (0,T,). 

3. Update the values of i, k, and t by the values of/— 1, k + iti, 0, 
respectively. 

4. As the value of / is greater than 0, go to Step 2. Otherwise the 
genealogical history of the samples is generated. Update the 
value of i by 1 

5. Add mutation events independentiy on different branches of 
the genealogical history for time interval (T',_i,T',) according 
to a Poisson process with rate equal to 0,/2,0, = 2A^,/(. At each 
mutation event, introduce mutational changes in the sequences 
according to the finite-sites model. 

6. Increase the value of / by 1. Stop if the value of/' is greater than 
m, otherwise go to Step 5. 

For the case of a non-recombining locus ('' = 0), the algorithm 
can be extended to include deterministic fluctuations in population 
size. Let N{T) be a function determining the population size at 



PLOS ONE I www.plosone.org 



5 



February 2014 | Volume 9 | Issue 2 | e87655 



A Framework with Recombination for HIV Evolution 



B 



B 



80 

EfSni(i) 60 
A 

Sni(i) 20 

o 

70 
60 

_ 50 
bdni(i) 40 
^ 30 

A 



pti /^^^VA F 





Efnni(i) 

A 

nni(i) 



40 
30 
20 
10 
0 



y^v 

Br \ pgr 



0 2 4 6 8 

sampling time (in years) 



02468 10 O 2 4 6 8 

sampling time (in years) sampling time (in years) 




Efnni(i) 

A 

nni(i) 



25 
20 
15 
10 



0 1 2 3 4 5 6 

sampling time (in years) 




0 1 234560123456 

sampling time (in years) sampling time (in years) 
Pt9 




Efnni(i) 



□ ^ 4 6 802468 10 O 2 4 6 8 

sampling time (in years) sampling time (in years) sampling time (in years) 



Figure 3. The fit of the model to the data in the finite-sites model case. In this case the population genetic model is fitted to the data by 
matching the observed values of the numbers of polymorphic sites, Siii(i}, and divergences, d„.{i), in the serial samples to their expected values, 
denoted by ]Ej5„,(/) and E^,,,!/), respectively. (A) shows the observed and fitted (expected) values of the numbers of polymorphic sites in serial 
samples. The observed and expected data points are connected by red and blue lines, respectively. (B) shows the observed and fitted (expected) 
values of the divergences in serial samples. Based on this fitting the vectors {Nigi}"'^^ and {;(/g,}''l| are estimated, and for the fitted model the 
predicted (expected) values of the average numbers of pairwise differences in serial samples are computed. (C) shows the observed and predicted 
values of this statistic in the serial samples, and the statistics are denoted by 7r„,(i) and ]Ejn„,((), respectively. 
doi:1 0.1 371 /journal.pone.0087655.g003 
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Figure 4. Observed and expected average numbers of pairwise differences between sequences at different sampling time points. 

Average number of pairwise difference between sequences in samples taken at times Tj and Tj are denoted by d„^^„.(ij). The observed and expected 
values of this statistic under the infinite-sites model and the finite-sites model are denoted by d„._„.{ij), ]Ei'^«,,;i,(',/)' snd TEfd„.^„.{iJ), respectively. (A) 
shows the observed values of di,.^„.{ij) in the serial samples for each individual's case. (B) and (C) show the predicted (expected) values of d„. „j(ij) in 
the serial samples for each individual's case computed receptively under the fitted models for the cases of the infinite-sites and finite-sites models. 
doi:1 0.1 371/journal.pone.0087655.g004 
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Figure 5. The fit of tKie model to tKie data in tlie infinite-sites model case. In this case the population genetic model is fitted to the data by 
matching the observed values of the numbers of polymorphic sites, S„.{i), and divergences, d„^{i), in the serial samples to their expected values, 
denoted by TEiS„i{i) and EAC'). respectively. (A) shows the observed and fitted (expected) values of the numbers of polymorphic sites in serial 
samples. The observed and expected data points are connected by red and blue lines, respectively. (B) shows the observed and fitted (expected) 
values of the divergences in serial samples. Based on this fitting the vectors {Mg,}"!,] and {/f/S(}"Li ^re estimated, and for the fitted model the 
predicted (expected) values of the average numbers of pairwise differences in serial samples are computed. (C) shows the observed and predicted 
values of this statistic in the serial samples and are denoted by n„.{i) and JEiiinX')^ respectively. 
doi:1 0.1 371 /journal.pone.0087655.g005 
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Table 2. The overall-fit scores of the population genetic 
models to the data sets from each individual. 





Individual 


score 1* 


score 2*' 


score 3"^ 


Ptl 


119 


184 


17063 


Pt2 


89 


246 


15603 


Pt3 


70 


102 


1529 


Pt5 


90 


143 


40184 
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37 
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6221 


Pt9 
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383 


154575 


Ptll 


23 


20 


28160 


*^ln the fitting process the observed numbers of polymorphic sites and 
divergences in the serial samples are matched to their expected values under 



the population genetic model in the finite-sites model setting. 
"^The fitting process is the same as In case of score 1 except that the infinite- 
sites model is considered. 

'^The overall-fit scores are computed in the content of the four statistics from a 
model that is estimated by matching the observed numbers of polymorphic 
sites and average numbers of pairwise differences in the serial samples to their 
expected values under the model in the infinite-sites setting. 
doi:1 0.1 371 /journal.pone.0087655.t002 

time T, Tq<T < T^, . Assuming that it is a continuous function in 
each of the intervals {Ti-\,Ti),i= 1, . . . ,m, and its limit at T-, from 
left is denoted by A*",, which is N{Ti — )=\\raT^Ti.T<TiN{T). 
Based on this notations and the transformation functions A,(t) = 
£ Ni/N{Ti-giNiu)du, 0<M,T<T,, defined by [49], I derive the 
following algorithm by modifying Algorithm 1 . 

Algorithm 2 

1. Set the values of/, k, and t to be equal to m, «,„, 0, respectively. 

2. Apply the above procedure for tracing the lineages of the k 
sequences for time interval (T',- 1 , J/) by generating the Markov 
chain for time interval (0,T,), where T/ = A,(t,). 

3. Update the values of i, k, and t by the values of i— 1, k + iij, 0, 
respectively. 

4. If the value of; is greater than 0 go to Step 2. Otherwise update 
the value of / by 1 . 

5. Transform the branch lengths of the generated genealogical 
history for time interval (0,T,) by applying the inverse of the 
function A,(-) to the coalescence waiting times in that part of 
the genealogy. (For example, if the part of the generated 
genealogical history for time interval (0,T,) starts with k 
lineages and is the waiting time for the number of the 
lineages to decline to j first time, then the corresponding 
coalescence time cj in the variable population size case is 

determined by the equation (fJ = Aj"'((fy), where A,^'(-) is the 
inverse of A,(-). 

6. Add mutation events independently on different branches of 
the part of the transformed genealogy according to a Poisson 
process with rate equal to 9j/2,6i = 2Njf.i. At each mutation 
event, introduce mutational changes in the sequences accord- 
ing to the finite-sites model. 

7 . Increase the value of / by 1 . Stop the process if / is greater than 
m, otherwise go Step 5. 



In the case of a non-recombining locus the process described in 
Algorithm 1 for constructing the genealogical history of n 
sequences sampled at time T„ can be simplified: instead of 
considering m tracing procedures for time intervals 
(0,T,),/= 1, . . . ,m, the genealogy of the sample can be constructed 
by a single tracing procedure for time interval (0,/), where / is 
equal to Ym=i ''^i = iTi — Tj-\)/{Nigi). This simplification is a 
result of the fact that the waiting times to the coalescent events in 
the tracing process are exponential random variables and satisfy 
the memoryless property. 

One implication of this result is that it allows to derive an 
analytical formula for the expected number of polymorphic sites in 
a sample of DNA sequences drawn randomly from the piecewise 
constant population size model combined with the infinite-sites 
model. Another analytical expression for the same quantity was 
also derived by Tajima [37] using recursion formulas for expected 
numbers of polymorphic sites. 

Lemma 1 Let S„{Ti) be the number of polymorphic sites in a sample of 
n DMA sequences drawn from the above population model at time T,. The 
mean of S„(Ti) can be computed by using the formula 

E5„(r,)=^^|<^^(T,)F(«,rf, A' 

j=\d = 2^ \ k=J+\ ) 

where Xj = (Tj— Tj-\)/(Njgj)j= 1, . . . ,m. The fiinctional (l>x{^) w the 
expected total branch length of the genealogy constructed by tracing x sequences 
according to the above procedure starting at 0 and ending at the latest time point 
before z and the most recent common ancestor; p(x,j',z) £f the probability that 
X sequences traced according to the described procedure have y ancestors at time 
z. Analytical expressions for ^^(f) and '^{x,y,z) were derived by [37 ,50- 
52] and [49], respectively. 

The proof of the lemma is in Appendix S 1 . 

Note that Tajima [37] derived the formula for WiS„{Ti) 
including the non-homogeneous population case at time Tq. To 
include that case in the formula (2), the expression 

'Y^"d=2^'iW{n,d,^'j^ii^ should be added to the right of 

equation (2). The quantity represents the expected number of 
polymorphic sites in a sample of d sequences drawn from the 
population at time To . Particularly, if the population before time 
To is modeled according to the Wright-Fisher model with constant 
population size, Aq, then e,/ is equal to 9o X]f=i' V'' according to 
Watterson's formula [23], in which 6o = 2Nol.i. 

The formula (2) also holds for the case of r>0 because the 
genealogies at the non-recombining parts of the sequences are 
identically distributed. Note that this formula also allows to 
compute the expected average number of pairwise differences in 
samples because that quantity is equal to the expected number of 
polymorphic sites in randomly chosen two sequences. Note that 
this formula can also be apphed for computing the expected 
average number of pairwise differences between two sequences 
sampled at two different time points. Let and sj be DNA 
sequences drawn from the above population model at times T, and 
Tj,j<i, respectively; and d{Si,Sj) is the number of sites at which 
the sequences Si and Sj differ. The expected value of d{si,Sj) can be 
computed by using the formula 

I 

JE.d(si,sj)=JE.S2{Ti)+ KTk-Tk^,)/gk. (3) 
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Figure 6. The dynamics of pD Q and />D'1 for the serial samples in each individual's case. For the serial samples from each of the 
individuals, the observed values of/;/) 0 and pD 1 as well as their expected values are plotted with respect to the sampling times. The observed data 
points are connected by red lines. For each of the values of /'//i equal to 0, 1, 10, 50, 100, and 200, the expected values of these two statistics are 
computed by using Monte Carlo approach and Algorithm 1 based on the estimated values of the vectors {A'/g,}"L| and {/.i/Si};=i f°r the finite-sites 
case. 

doi:1 0.1 371 /journal.pone.0087655.g006 
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r/M=200 




0 2 
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Figure 7. The 95% probability intervals for 0 and pL) 1 in the case of individual Pti. The observed values of the statistics pEiO and pL) 1 
at sampling time points are connected by red lines. For each of the values of r/ p and at each sampling time point the 95% probability interval is 
inferred by estimating 2.5% and 97.5% quantiles of the statistics under the estimated model in the finite-sites case. The vertical intervals at the 
sampling time points represent the 95% probability intervals in green, black, and orange when r/ p is 0, 1 , or 200, respectively. In each case the same 
colors are respectively used to connect the expected values of the statistics. 
doi:1 0.1 371/journal.pone.0087655.g007 



Application of the framework 

I apply the models and methods described in the previous 
section for exploring within-host HIV evolution based on 
serial samples of HIV DNA sequences from 9 HIV- 1 -infected 
individuals studied by [21]. As in that study, I also use the same 
identifiers for the 9 individuals: Ptl, Pt2, Pt3, Pt5, Pt6, Pt7, Pt8, 
Pt9, Ptll. The serial samples were based on blood specimens 
provided by the individuals at their semiannual visits since 
seroconversion. The sequences are from C2-V5 region of the 
HIV-1 envelop gene and are 764 bases long including nucleotides 
and insertion/deletions based on the alignments of aU the 
sequences. 

First, I fit the population genetic model to data sets under the 
finite-sites model, in which the values of Lh and Li are estimated 
from the alignments of all the sequences in the serial samples for 
each individual's case by classifying the nucleotide sites into three 
groups. Group 1 includes nucleotide sites that are conserved; 
group 2 represents sites at which only two nucleotides are present; 
the rest of the sites are in group 3: sites at which more than 2 
nucleotides are present. Table 1 shows the sizes of the groups 2 
and 3 that determine the values of L/ and L/,, respectively. 

I implemented Algorithm 1 for this mutation model into a 
computer program (in C programming language) for generating 
the polymorphisms in serial samples and applying the Monte 
Carlo approach to estimate the expected values of summary 
statistics for the observed sample sizes. To fit the model to the data 
sets for each individual's case, I consider two summary statistics: 
the numbers of polymorphic sites and divergences in the serial 
samples. Divergence in a sample of sequences is defined as the 
average of the numbers of differences between the founder 
sequence and the sequences in the sample. The founder sequence 
is the sequence of the homogeneous population at the beginning; 
and for the observed samples, the founder sequence is inferred 
from the alignment of the sequences in the first sample taken (at 



time T\) after seroconversion. The most frequent nucleotide at 
each nucleotide site in the alignment of the sequences in that 
sample is defined as the nucleotide of the founder sequence. To 
estimate the parameters {Ni}"'^^ and }''!], I recursively match 
the observed values of the two statistics to their closest expected 
values. Thus, I first estimate the values of N\g\ and n/g\ and then 
recursively estimate the other elements of the vectors {Af(g;}"Li 
and {A'/gi}"Li. The parameter vectors {A^,}"ij and are 
estimated up to a constant factor ^i. Figure 3 shows the fitted 
dynamics of the expected values of the two statistics to then- 
observed values in the serial samples. 

To assess the overfitting of the estimated model to the data, I 
consider two additional statistics: the average numbers of pairwise 
differences between and within the serial samples. Figures 3 and 4 
show the observed and expected dynamics of the two statistics, in 
which the expected (predicted) values of the two statistics are 
estimated under the fitted model. The statistics are used as 
controls, and the overall fit of the estimated model in each 
individual's case is quantified by the overall-fit score defined as 
follows. The observed and expected dynamics of the four statistics 
(computed under the fitted model) in the serial samples are 
represented as vectors of numbers and the overall-fit score is 
defined as the sum of the Euchdean distances between the 
observed and expected vectors. Thus, the overall-fit scores carry 
the tread-off between fit and prediction for the estimated models in 
the content of the four statistics. 

I also fit the model to the data sets under the infinite-sites model 
and compare the mimicking powers of the two models with respect 
to the data sets by using graphical assessment as well as using the 
overall-fit scores. The estimation procedure is the same as in the 
previous case except that the analytical formulas (2) and (3) are 
used for computing the expected values of the four statistics. In this 
case the estimated expected values of the four statistics do not 
show strong qualitative discrepancy with their observed values 
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except in the case of individual Pt9 (see Figure 5 and 4). Table 2 
shows the overall-fit scores of the two models for each individual's 
case. Thus, the estimated finite-sites model shows better qualitative 

and quantitative mimicking power in each individual's case. 

The contra.st between the mimicking powers of the two 
mutation models in the content of the four statistics is more 
obvious when I fit the population genetic model to the data sets 
under the infinite-sites model by matching the observed values of 
the numbers of polymorphic sites and average numbers of pairwise 
differences in the serial samples to their expected values, and I use 
the other two statistics as controls. The overall-fit scores in this 
case are also in Table 2. Note that in this case the expected 
(predicted) values of divergence over time increase much faster 
and have bigger values than the observed values (details not 
shown) which are less than 100, but the sequences are about 764 
bases long. Intuitively, such discrepancy can be controlled by 
considering a finite-sites model in which sequence length is much 
smaller than 764 bases, and this intuition was one of the reasons 
for considering the finite-sites model descried in this paper. 

The signature of recombination on the dynamics of 
within-host HIV genetic diversity 

I use the estimated models for the case of the finite-sites model 
(described in the previous section) to explore the signature of 
recombination on the dynamics of HIV genetic diversity. For this 
purpose I choose two statistics based on the linkage disequilibrium 
measure D [53,54] since the expected values of the four statistics 
described in the previous section are independent on the 
recombination rate. The two statistics are denoted by pD^ and 
/)Z)j and described as follows. To determine the values of these 
statistics for a sample of DNA sequences, I first compute D for all 
pairs of polymorphic sites in the sample by using the formula 14 by 
[54], which extends the definition of Z) for polymorphic sites with 
multiple alleles. The statistics pD^ and pD^ represent the 
proportion of the computed D that are equal to 0 and 1, 
respectively. Intuitively, one can expect pD^ and 1 —pD^ to 
decrease as r increases since recombination breaks down linkage 
disequilibrium between nucleotide sites. For the case of two 
biaUelic linked polymorphic sites created by two mutations, Z) is 0 
when the sites are in equilibrium (r= oo) and it is 1 or greater than 
0 for completely linked sites {r = Q). 

Using the computer program based on Algorithm 1 and Monte 
Carlo approach, I estimated the expected values of pD^^ and pD^ 
in the serial samples under the estimated models for each 
individual's case and for each of the values of r/ ji to be equal to 
0, 1, 10, 50, 100, 200. The locus length / in the simulations is 10 
and the numbers of the loci in the regions and Li are 
respectively equal to \ Lhl^\ and \_Li/l\\ [_x\ is the greatest integer 
number that is less than x. To avoid numerical problems, the 
values of the two statistics in the simulations are determined by the 
conditions /)'< 0.0000001 and Z)' > 0.9999999 instead of Z)'=0 
and D =\, respectively. The expected and observed dynamics of 
the two statistics in the serial samples for each individual's case are 
in Figure 6. The results show the following interesting common 
trends for the dynamics of the two statistics: (1) For most of the 
individual cases, the obser\'ed dynamics of the two statistics 
fluctuate between the expected dynamics of the two statistics for 
rj^ equal to 0 and 1. (2) Some of the observed values of the two 
statistics are not expected for any of the considered values otr/fi, 
which is particularly obvious for the cases of the individuals Ptl 
and Pt6. 



In the case of patient Ptl, I explore further and consider three 
hypotheses for : it is equal to 0, 1, or 200. I consider each of 
the hypotheses as a null hypothesis, and test them for data sets at 
the sampling time points by estimating 2.5% and 97.5% quantiles 
of each of the statistics pD^ and pD^ based on the estimated 
model. Figure 7 shows that each of the hypotheses is rejected at a 
5% significance level for some of the data sets at the sampling time 
points. These trends in the dynamics of the two statistics can be 
taken as a consequence of selection pressure on the HIV-l envelop 
gene region. 

Discussion 

Some modeling issues 

The purpose of this study was to develop a computationally and 
statistically tractable framework, including recombination, for 
analyzing the dynamics of HIV genetic diversity in HIV-infected 
individuals. To derive this framework, I first designed a population 
genetic model that carries some of the features of within-host HIV 
evolution. Particularly, the model includes recombination, vari- 
ability in population size and generation time, and heterogeneity 
of mutation rate across nucleotide sites. In addition, I considered 
the population size and generation time to vary over time as 
piecewise constant functions of time; these choices were made in 
order to derive the framework including recombination and 
without overwhelming the model with parameters that would be 
difficult to estimate. 

In spite of these choices, the model and framework can be 
extended for other evolutionary settings. Particularly, the model 
can be extended to include various distributions for the break- 
points along HIV genome, as well as for mutation rates across 
nucleotide sites. As another extension of the framework, I 
described Algorithm 2 that is applicable for serial samples of 
non-recombining sequences in a more general demographic 
scenario by allowing the population size to be a piecewise 
continuous function of time. Such a demographic scenario may be 
more suitable for exploring evolutionary dynamics of other 
pathogens at genomic level in vitro and in vivo. Particularly 
in \'itro experiments in which a bacterial population goes through 
recurrent bottlenecks by growing or declining exponentially over 
time. However, in this setting the number of the parameters can 
increase and can be challenging to estimate them. For example, if 
the population size N(T) declines or grows exponentially on 
intervals {Tj_\,Tj), i=\, . . . ,m, as a function of time T, the left 
and right limits iV(r, -) and iV(r, + ) of iV(r) at T, determine 
this function. Thus, they become parameters of the model, and the 
total number of parameters can increase at most by m, in 
comparison to the piecewise constant population size case. 
Exception is the case when N{T) stays constant on interval 
{Tq,Ti) and changes continuously on {T\,T„^ (that is 
N(T,-) = NiTi + )). 

Another extension of the model is to replace the assumption of 
homogeneous population at time To by considering the population 
to evolve according to the equilibrium Wright-Fisher model before 
time Tq. For this case Algorithm 1 should be modified by allowing 
the lineages of the samples to be traced back in time before the 
most recent common ancestor. In this setting the genealogical 
process for a sample of non-recombining DNA sequences is the 
same as the genealogical process in the standard coalescent 
because the waiting times between consecutive coalescent events 
are exponential random variables which have the memoryless 
property. However, mutation events on the lineages of a such 
genealogy are added according to an inhomogeneous Poisson 
process with rate equal to 0i/2 = NiH for time interval 
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i = 0,...,m, where ?_i = — oo, to = 0, and ti=Yl'k=\'^- This 
presentation is consistent with the statement provided by [31], and 
shows the robustness of the genealogical process in the standard 
coalescent, which was also observed in other evolutionary settings 
[15,24,25]. 

Note that the developed framework can be applied to generate 
HIV transmission chains and HIV epidemics at the genomic level. 
To be able to accomplish such a task, it is important to have a 
better understanding of the space of the \alues of the 
vectors{iV,g,} and {yu/g",}. The moment matching approach used 
in this paper has limited power to assess uncertainties in the 
estimates of these vectors, this challenge can be overcome with 
more computational cost by incorporating a rejection method [55] 
within the developed framework. Such a method might also 
overcome the fitting limitations of the moment matching method 
as I observed that the expected dynamics of divergence from the 
founder sequence show non-decreasing behavior over time but the 
observed dynamics of this statistic show some fluctuations (see 
Figures 3 and 5). 

The signatures of recombination and selection on the 
dynamics of within-host HIV genetic diversity 

The application of the framework to the serial samples of HIV 
DNA sequences from nine HIV infected individuals allowed to 
explore the fit of the model to the data sets and the impact of the 
recombination on the dynamics of within-host HIV genetic 
diversity. Particularly, these results show large variability for 
inferring the r/fi ratio (recombination rate over mutation rate) at 
dilTerent sampling time points (Figures 6). These results are 
consistent with other studies (see [56] and references therein) that 
also obser\'ed very wide range for estimates of recombination rate 
in various viruses. Therefore, it is not clear if a particular estimated 
value of r/ fi can be taken as a representative value. For example, 
in the case of individual Ptl the ratio at different sampling 
time points can be inferred to be 0, 1, or 200, and in the meantime 
Figure 7 shows that for each of these values there is a time point at 
which the estimated 95% probability intervals of the statistics pD^^ 
and pD^ exclude the observed values of these statistics. 

The wide spectrum of the inferred values of r//^ also explains 
the contrasting estimates of the ratio by other studies. Resent 
studies [38,39] used serial samples of HIV DNA sequences and 
inferred within-host HIV recombination rate to be smaller than 
the mutation rate. In contrast, Shriner et al [16] inferred within- 
host HIV recombination rate to be 5.5 fold greater than point 
mutation rate; they derived the results based on a sample of HIV 
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